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Horizontal gene transfer (HGT) often leads to phylogenetic incongruence. When "duplicative HGT" introduces a second 
copy of a pre-existing gene, the two copies may then engage in gene conversion, leading to phylogenetically mosiac 
genes. When duplicative HGT is followed by differential gene conversion among descendant lineages, as under the DH- 
DC model, phylogenetic analysis is further complicated. To explore the effects of DH-DC on phylogeny reconstruction, we 
analyzed two sets of sequences: (1) an augmented set of plant mitochondrial atpl sequences for which we recently 
published evidence of DH-DC; and (2) a set of simulated sequences for which we varied the extent of chimerism, the 
number of chimeric genes and nucleotide substitution rates. We show that the phylogenetic behavior of evolutionarily 
chimeric genes is highly volatile and depends on both the degree of chimerism and the number of differentially chimeric 
genes present in the analysis. Furthermore, we show that the presence of chimeric genes in gene trees can spuriously 
affect the phylogenetic position of purely native sequences, especially by attracting these sequences toward basal 
positions in trees. We propose the term "HGT turbulence" to describe these complex effects of evolutionarily chimeric 
genes on phylogenetic results. 



Introduction 

Horizontal gene transfer (HGT) is very 
common and of great importance in 
bacterial evolution 1 ' 2 and is also relatively 
common in certain eukaryotic lineages. 3 ' 4 
One consequence of HGT is incong- 
ruence between phylogenies reconstructed 
from genes with different histories of 
transfer, including of course no transfer 
at all. Finding and examining phylogene- 
tically atypical genes has become a stand- 
ard task in HGT studies. 5 " 7 In cases where 
an entire gene has been transferred, 
without the complication of recombina- 
tion/conversion with a native homolog, 
the donor lineage can in principle be 
identified as the nearest phylogenetic 
neighbor. 7 " 9 Gene conversion enters the 
picture either during the act of gene 
transfer, when transiently-present foreign 
DNA directly converts (replaces) part of a 
native locus 10 ' 11 or, after duplicative HGT, 
via potentially ongoing gene conversion 



between co-existing native and foreign 
copies. 12 Furthermore, gene conversion 
can occur in either a continuous or 
discontinuous manner. 1314 Overall, then, 
gene conversion can lead to a potentially 
complex and diverse set of patchwork 
recombinant sequences, especially if it 
occurs repeatedly, and differentially, over 
the course of speciation. 12 Each recom- 
binant gene, if analyzed as a whole, 
might or might not reflect the true 
evolutionary history of either or both 
parental sequences. When parental 
sequences contribute differentially to the 
number of informative characters in a 
recombinant sequence, this sequence will 
tend to resemble the parental sequence 
that contributes more informative char- 
acters (e.g., refs. 15 and 16), whereas 
when parental sequences contribute sim- 
ilar numbers of informative characters, 
the recombinant will potentially be quite 
different from both parental sequences, 
depending of course on the degree of 



divergence of the two parental sequences 
from each other. 17 

When properly recognized and dealt 
with, recombination poses few problems 
for phylogenetic analysis and interpreta- 
tion. In practice, however, recombination 
detection is challenging and often subject 
to failure. First, it is well established that 
recombination detection programs per- 
form poorly when sequence divergence 
is low. 18 " 21 Unfortunately, plant mito- 
chondrial genomes, which collectively 
constitute a premiere model system for 
eukaryotic HGT studies, 4 ' 22 usually have 
very low rates of nucleotide substitution. 23 
Second, gene conversion often involves 
very short tracts of DNA, 13 ' 14 which can 
make recombination detection very diffi- 
cult. For instance, ten previously pub- 
lished recombinant regions between plant 
mitochondrial and chloroplast genes range 
in length from only 14 to 79 nucleo- 
tides. 24 ' 25 Third, existing recombination 
detection programs are generally designed 
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to identify a single or only a small 
number of recombination breakpoints. 26 ' 27 
Intricate gene conversion during the pro- 
cess of duplicative HGT and differential 
gene conversion (DH-DC) can, however, 
lead to mosaic gene structures, with 
multiple foreign regions interspersed with 
native regions on a fine scale. 

We recently reported such mosaicism 
in mitochondrial atpl and matR genes 
belonging to different groups of flowering 
plants (angiosperms). We show that these 
mosaic genes largely escaped detection by 
recombination-detection programs and 
were recognizable only by direct visual 
inspection of DNA sequence alignments. 12 
In this report, we explore the effects of 
chimeric sequences on phylogeny recon- 
struction by conducting phylogenetic 
analyses on simulated sequences and on 
an augmented set of the atpl sequences 
analyzed in reference 12. 

Results 

Phylogenetic analysis of naturally-occur- 
ring mosaic genes. In a recent study, 12 we 
reported the presence of three differentially 
mosaic types of mitochondrial atpl genes 
in the angiosperm genus Ternstroemia 
(Pentaphylaceae, Ericales) and concluded 
that they arose via DH-DC, with the 
blueberry genus Vaccinium (Ericaceae) the 
best candidate to be the donor group 
in the initiating HGT event. As shown 
in Figure 1 (adapted from Fig. 2A of 
ref. 12), each of the three major clades 



within Ternstroemia possesses a differenti- 
ally mosaic atpl gene, each with multiple 
(four to five) foreign regions interspersed 
with native regions. In the one atpl 
phylogeny presented in reference 12, the 
mosaic Ternstroemia genes were all placed 
within the Eriaceae, in a paraphyletic 
relationship with respect to Vaccinium. 

To better understand how mosaic 
genes affect phylogeny reconstruction, we 
sequenced the mitochondrial atpl gene 
of Chamaedaphne calyculata, a close rela- 
tive of Vaccinium, 28 using the same set of 
primers and method as in reference 12, 
and employed it together with varying 
subsets of previously sequenced Ericales 
atpl genes in maximum likelihood phylo- 
genetic analyses. The Chamaedaphne atpl 
sequence was deposited in GenBank under 
accession number JN808446. Consistent 
with our recent study, 12 and in contrast 
to organismal phylogeny (Fig. 2A), in an 
analysis that included all relevant genes, 
the three types of mosaic atpl genes 
in Ternstroemia formed a paraphyletic 
assemblage, with T. fragrans sister to the 
Vaccinium/ Chamaedaphne clade, the T-glj 
clade the most distant from the Vaccinium/ 
Chamaedaphne clade and the T-ips clade 
in an intermediate position (Fig. 2B). 
Remarkably, when only one type of 
mosaic atpl gene was included in a given 
analysis, each of the three types fell 
in a different phylogenetic position 
(Fig. 2C-E). This shows that the phylo- 
genetic position of a mosaic gene can vary 
depending on the inclusion of additional, 



related mosaic genes and emphasizes that 
this position provides little or no reliable 
information on the nature of the gene's 
parental sequences. 

The only other topological difference 
among all five atpl trees (Fig. 2B-F) 
involves Chamaedaphne, which was 
weakly placed (41% bootstrap support) 
within Vaccinium when all mosaic genes 
were included (Fig. 2B), but was placed as 
sister (with 68-92% support) to a mono- 
phyletic Vaccinium in the other four gene 
trees. This result raises the possibility that 
the inclusion of mosaic sequences in 
phylogenetic analyses can affect not only 
the placement of related mosaic sequences, 
but also the placement of apparently native 
sequences. 

Phylogenetic analysis of simulated 
chimeric sequences. We used simulation 
studies (conducted using Seq-Gen 29 ) to 
further explore the effects of chimeric 
sequences on phylogeny reconstruction. 
The following simulation parameters 
were chosen to be the same as those 
used in the above analysis of atpl 
sequences: (1) sequence length, 1200 
nucleotides; (2) substitution model, GTR; 
(3) gamma shape parameter, 0.218; (4) pro- 
portion of invariant sites: 0.371; (5) nucleo- 
tide frequencies, 0.271 (A), 0.207 (C), 
0.261 (G) and 0.261 (T); and (6) GTR 
relative rate parameters, A < - > C = 0.818, 
A < - > G = 1.938, A < - > T = 0.244, 
C < - > G = 0.884, C < - > T = 2.219 
and G < - > T = 1.000. All but the first 
of these parameters are based on PhyML 30 




Figure 1. Three types of mosaic mitochondrial atpl genes in Ternstroemia (adapted from ref. 12). The multi-colored boxes represent atpl genes of the 
three subclades within Ternstroemia. Black vertical lines represent the 38 nucleotide positions inferred12 to have differed between donor and recipient 
atpl genes at the time of atpl transfer from Vaccinium to a common ancestor of Ternstroemia. Lines at the top of the boxes and red shading indicate 
sites and regions, respectively, of putatively foreign, Vaccinium ancestry, while bottom lines and blue shading represent native sites and regions. White 
lines centered within the boxes represent the only two sites that otherwise differ within the Ternstroemia clade. "T-ips" refers to the Ternstroemia 
subclade containing T. impressa, T. peduncularis and T. stahlii) and "T-glj" to the subclade containing T. gymnanthera, T. longipes and T.japonica (see also 
Fig.2A). 
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Figure 2. Phylogenetic analysis of mosaic mitochondrial 0fp7 genes. (A) Chronogram showing organismal relationships and divergence times of relevant 
taxa belonging to the Ericales. As described in reference 12, the chronogram was constructed by the BEAST program 43 using a Eurya reference fossil 
calibration of 86 Myr ago. 44 (B-F) Maximum likelihood phylogenies of mitochondrial atpl genes from the taxa shown in (A), with these analyses varying as 
to which members of Ternstroemia (shown in red), whose atpl gene is differentially mosaic, were included. RAxML 32 version 7.0.4 was used to construct 
all phylogenies with a GTR+r+l substitution model. A total of 1000 bootstrap iterations were performed, with all bootstrap values > 50% shown 
on the trees. Phylogenies were rooted using Fouquieria, Marcgravia and Pentamerista as unshown outgroups (hence the stub branch at the base 
of each gene tree). 



estimates derived from the mitochondrial 
atpl alignment shown in Figure SI, except 
with the mosaic Ternstroemia genes 
excluded (as in Fig. 2F) because recombin- 
ant genes have been shown to alter the 
estimation of substitution rate heterogen- 
eity. 31 The remaining simulation parameters 
were independent of the atpl data and 
include the number of sequences, their 
topology, relative branch lengths and abso- 
lute amount of divergence (Fig. 3 A). For 
simplicity, only one recombination break- 
point was allowed in each chimeric 
sequence, with these chimerics constructed 
to contain varying proportions (from 50:50 
to 10:90) of two of the 16 "parental" 
sequences generated by the simulations. 
The 1 6 purely-simulated sequences together 
with the artificially constructed chimeric 
sequence were used in phylogenetic analyses 
performed using RAxML version 7.O.4. 32 

Two sets of simulation analyses were 
performed. In the first set (Fig. 3), we 



varied the parental proportions that com- 
prise the chimeric sequence (s) and the 
number of chimeric sequences included in 
a given analysis. In analyses with a single 
chimeric sequence, this sequence grouped 
with 100% bootstrap support with its 
majority parental sequence when the 
parental ratio was 10:90 (i.e., when the 
chimeric sequence consisted of 10% 
of sequence 9 and 90% of sequence 1; 
Fig. 3B). The same topology was obtained 
with the 30:70 chimera (Fig. 3C), but the 
bootstrap value dropped to 92. When the 
chimera was 50:50 (Fig. 3D), it went to 
the base of the tree, in between the two 
main clades of simulated sequences, and 
with bootstrap support reduced along 
the branches leading to both parental 
sequences. These results thus show that 
the phylogenetic position of chimeric 
genes can vary substantially as the propor- 
tion of parental sequences that comprise 
the chimerics varies. 



The phylogenetic position of the 50:50 
chimeric sequence also varied substanti- 
ally depending on whether it was the 
only chimeric sequence in the analysis 
(Fig. 3D) or whether the 30:70 and/or 
10:90 sequences were also included 
(Fig. 3E-G). Inclusion of the 30:70 
sequence, either with (Fig. 3E) or without 
(Fig.3F) the 10:90, "pulled" the 50:50 
from the base of the tree to its periphery, 
together with the 30:70 and parental 
sequence 1 (and with the 10:90 when 
included) and with strong support (96% 
and 92%, respectively). This peripheral 
attraction was more subdued when the 
50:50 was paired with the 10:90 (Fig. 3G) 
as opposed to the 30:70 (Fig. 3F), pre- 
sumably because of the greater proportion 
of sequence length shared by the 30:70 
and 50:50 (80%) relative to the 10:90 and 
50:50 (60%). Also, there is evidence for a 
mutual attraction between the 10:90 and 
50:50 (but not between the 30:70 and 
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Figure 3. Variable phylogenetic placement of chimeric genes demonstrated by simulations. Artificial sequences were simulated as described in the text. 
Chimeric sequences were generated by combining the 5' and 3' portions of sequences 1 and 9, respectively (both circled in red), with different length 
ratios (10:90, 30:70 or 50:50) of the two parental sequences. One thousand tree-building iterations were performed. The tree shown in each panel is 
based (for computational ease) on the concatenated sequences of the first 100 iterations, while the bootstrap support values are from all 1000 trees. 
All bootstrap values are > 95% except for those shown. 



50:50), in that the bootstrap support for 
the 10:90 grouping with parental sequence 
1 was reduced from 100% (Fig. 3B) to 
only 54% (Fig. 3G). These simulation 
results are consistent with the empirical 
results (Fig. 2) in showing that the 
phylogenetic position of a chimeric gene 
can vary substantially when different 
additional chimeric genes are sampled. 

We also found that chimeric genes can 
perturb phylogenetic analysis by pro- 
ducing branch length bias. First, certain 
chimeric genes were directly associated 
with elongated branch lengths. In the 
single-chimeric analyses, the 10:90 chi- 
mera had a notably longer branch length 
than sequence 1 (Fig. 3B), while the 30:70 
branch length was even longer (Fig. 3C), 
with the (homoplasious) substitutions 
contributed by the 30% of the chimeric 
sequence originating from sequence 9 
presumably responsible for this greatly 
extended branch length. Similar results 
were obtained in Figure 3E-H, where 
multiple chimeric sequences were included 
in each analysis. Second, the presence of 



the 50:50 chimeric in the single-chimeric 
analysis of Figure 3D resulted in the loss 
of a molecular clock among the non- 
recombinant sequences in this analysis, 
with the branch leading to the (1-4) clade 
only 60% of the length leading to the 
(5-8) clade and the branch leading to the 
(9-12) clade only 58% of the length 
leading to the (13-16) clade. There are 
consistently two additional — albeit much 
less pronounced — sets of branch length 
differences in the six trees (Fig. 3B, C and 
E-H) in which one or more chimeric 
sequences are sister to sequence 1. 

The second set of simulations showed 
that chimeric sequences can also confound 
phylogenetic analysis by altering the 
placement of non-recombinant (native) 
sequences. In these simulations (Fig. 4), 
we varied the absolute amount of sequence 
divergence across the tree, with the 50:50 
sequence the only chimeric sequence in 
each analysis. Figure 4A shows the same 
tree as Figure 3D, while Figure 4B-D 
has expanded divergence by a factor of 
2 X, 5 X and 10 x , respectively. As the 



simulated sequences become more diver- 
gent, the parental sequences show an 
increased and pronounced, tendency to 
be attracted toward the base of the tree 
by the chimeric gene. Also, the branch 
leading to the chimeric sequence becomes 
increasingly short, approaching zero in 
Figure 4C and D. It is important to note 
that this effect is not simply the result 
of there being increasingly more informa- 
tive characters from Figure 4A—D. In 
analyses with 5 x or 10 X divergence 
(data not shown), but only 1/1 0th the 
sequence length (120 nucleotides rather 
than the 1200 in the simulations shown 
in Figure 4), the topology and bootstrap 
values were essentially identical with those 
in Figure 4C and D, respectively, with 
these being substantially different from 
those in Figure 4A and B. The different 
topologies shown in Figure 4 must there- 
fore result from deterministic effects arising 
from the varying levels of divergence in 
these trees. These simulations thus show 
that inclusion of chimeric sequences can dis- 
tort the branching pattern of nonchimeric, 
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Figure 4. Chimeric sequences increasingly disrupt native-sequence topologies as divergence increases. (A) The same tree as shown in Figure 3D. 
(B-D) Maximum likelihood analysis of the same simulated sequences, but with increasing branch-length scales across the set of sequences: 
(B) 2 x the scale in (A); (C) 5 x ; and (D) 10 x . One thousand tree-building iterations were performed. The tree shown in each panel is based 
(for computational ease) on the concatenated sequences of the first 100 iterations, while the bootstrap support values are from all 1,000 trees. 
All bootstrap values are > 95% except for those shown. 



native sequences and that the extent of 
this distortion varies directly with the 
absolute level of sequence divergence. 

Discussion 

The analyses reported in this study show 
that the inclusion in phylogenetic analyses 
of chimeric sequences arising from HGT 
and gene conversion can produce a variety 
of spurious phylogenetic results. These 
include the misplacement of both chimeric 
and native sequences, as well as branch 
length distortions. Accordingly, we intro- 
duce the term "HGT turbulence" as a 
general moniker for this category of phy- 
logenetic artifacts. Mutually reinforcing 
evidence for these types of HGT turbulence 
was apparent in the phylogenetic analyses of 
both simulated sequences and a naturally 
occurring set of native and chimeric 
mitochondrial sequences, with the simula- 
tions providing greater opportunity to 
crystallize and illustrate specific sets of 
sequence interactions and consequences. It 
is also important to realize that while these 
simulations were framed and presented in 
the context of HGT, their results apply 
equally well to conversion between paralogs 
arising from internal gene duplication as to 
xenologs arising from HGT. 

HGT turbulence is probably relatively 
common in bacteria, 1 ' 2 given the prevalence 
of HGT and recombination during bacterial 
genome evolution. For example, such 
diverse bacteria as Neisseria meningitidis, 5534 



Streptococcus pneumoniae 55,56 Helicobacter 
pylori 57 and Wolbachia 38 have been found 
to be so recombinogenic that scientists have 
resorted to using multiple loci (e.g., multi- 
locus sequence typing 39 ) as opposed to a 
single locus to identify clones. Surprisingly, 
however, the phenomenon of HGT tur- 
bulence has never been explicitly addressed 
in the bacterial literature. One reason for 
this is that these studies have mainly focused 
on minimizing the effect of recombination 
and thereby inferring accurate evolutionary 
relationships, or on quantifying the number 
of recombinant genes, rather than actually 
exploring the topological alterations caused 
by HGT turbulence. Also, the precise 
origins of horizontally transferred genes (or 
gene fragments) in bacterial genomes can 
be extremely difficult to recover, especially 
when transfer and/or subsequent recom- 
bination have occurred on a fine scale. 

Several other studies 17 ' 31 ' 40 " 42 have used 
recombinant sequences in phylogenetic 
simulations, but most of these have 
focused on the issue of whether recombi- 
nants are detectable and/or how to detect 
them. Perhaps most relevant to our 
study is the 2002 study by Posada and 
Crandall, 17 which also explored the rela- 
tionship between the location of recombi- 
nation breakpoints and the phylogenetic 
placement of recombinant sequences, 
reaching similar conclusion to ours on this 
point. However, none of these studies 
explored the effects of combining multiple, 
related chimeric sequences in the same 



analysis, nor did they show that chimeric 
sequences can, under certain conditions, 
substantially alter the phylogenetic behavior 
of native sequences. Further simulation 
studies on the effects of recombination on 
phylogenetic inference would therefore 
appear to be called for. 

Recognition of the remarkable frequency 
and extent of horizontal gene transfer, and its 
often great evolutionary importance, is 
arguably the greatest accomplishment of 
the past 15 years of comparative genomics 
research. Because HGT is so common and 
important, recognizing and properly dealing 
with HGT turbulence is likewise important. 
This is so not only because of the obvious 
need for obtaining accurate estimates of gene 
and species phylogeny, but also because 
otherwise too many cases of chimeric HGT, 
including complex situations involving 
DH-DC, will continue to go overlooked. 
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